Matching a Set of Strings with Variable Length Don't Cares

نویسندگان

  • Gregory Kucherov
  • Michaël Rusinowitch
چکیده

Given an alphabet A, a pattern p is a sequence (v l , . . . ,vm) of words from A* called keywords. We represent p as a single word vl@.. 9 @vm, where @ ~ A is a distinguished symbol called variable length don't care symbol. Pattern p is said to match a text t E A* if t = UoVlUa...um_xv,~u,~ for some u0 , . . . ,u ,~ E A' . In this paper we address the following problem: given a set P of patterns and a text t, test whether one of the patterns of P matches t. Quoting Fisher and Paterson in the concluding section of [10], "a good algorithm for this (problem) would have obvious practical applications". For instance, as it was reported by Manber and Baeza-Yate s [13], the DNA pat tern TATA often appears after the pat tern CAATCT within a variable length space. It may therefore be interesting to look for the general pat tern CAATCT@TATA. If we are given a set of such general patterns, it is desirable to have an algorithm that searches for all of them simultaneously instead of searching consecutively for each one. In this paper we propose an algorithm that solves the problem in t ime O(([t[ + [P])log IP[), where ]t I is the length of the text and [P[ is the total length of all keywords of P. Several variants of the problem have been considered in the literature 9 Matching set of strings with "unit length don't care symbols" that match any individual letter, was studied in [10, 15]. Bertossi a~d Logi [5] have proposed an efficient parallel algorithm for finding in a text the occurrences of a single pat tern with variable length don't-care symbols. Their algorithm has an O(log ]t[) running t ime on O([t[[P[/log [tl) processors. Our problem can also be viewed as testing membership of a word in a regular language of type '~ 9 i 9 i 9 9 i 9 tJi=lA u lA u2A .. Note that any 9 A u,~A . regular expression where the star operation only applies to the subexpression A (i.e. the union of all letters) can be reduced to the above form by distributing concatenation over union. An O(]t[[Z[/log [t I) solution for the case of a general regular expression E has been given by Myers [14]. The algorithm we propose here reads the text and the patterns from different tapes in the left-to-right fashion. The text is searched on-line, which means that the match is reported immediately after reading the shortest matched portion of the

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximate String Matching with Variable Length Don ' t Care

Searching for DNA or amino acid sequences similar to a given pattern string is very important in molecular biology. In fact, a lot of programs and algorithms have been developed. Most of them are based on alignment of strings or approximate string matching. However, they do not seem to be adequate in some cases. For example, the DNA pattern TATA (known as TATA box) is a common promoter that oft...

متن کامل

On pattern matching with k mismatches and few don't cares

We consider the problem of pattern matching with k mismatches, where there can be don't care or wild card characters in the pattern. Specifically, given a pattern P of length m and a text T of length n, we want to find all occurrences of P in T that have no more than k mismatches. The pattern can have don't care characters, which match any character. Without don't cares, the best known algorith...

متن کامل

A faster algorithm for matching a set of patterns with variable length don't cares

a r t i c l e i n f o a b s t r a c t We present a simple and faster solution to the problem of matching a set of patterns with variable length don't cares. Given an alphabet Σ, a pattern p is a word p 1 @p 2 · · · @p m , where p i is a string over Σ called a keyword and @ / ∈ Σ is a symbol called a variable length don't care (VLDC) symbol. Pattern p matches a text t if t = u 0 p 1 u 1. .. u m−...

متن کامل

Matching Integer Intervals by Minimal Sets of Binary Words with don't cares

We consider n-bit binary encodings of integers. An integer interval [p, q] can be considered as a set X of binary strings corresponding to encodings of all integers in [p, q]. The most natural encoding is the usual binary representation of integers (lexicographic encoding) and another useful encoding, considered in the paper, the reflected Gray code. A word w with a don’t care symbol is matchin...

متن کامل

Finding Patterns with Variable Length Gaps or Don't Cares

In this paper we have presented new algorithms to handle the pattern matching problem where the pattern can contain variable length gaps. Given a pattern P with variable length gaps and a text T our algorithm works in O(n + m + α log(max1<=i<=l(bi − ai))) time where n is the length of the text, m is the summation of the lengths of the component subpatterns, α is the total number of occurrences ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995